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abstract 


Multiple  linear  regression  theory  provides  an  estimated  co- 
variance  structure  for  the  estimates  of  the  parameters  of  the 
linear  function  based  on  given  data.  However,  when  the  deviation 
form  is  used  to  calculate  these  parameter  estimates,  the  portions 
of  this  matrix  which  involve  the  constant  term  are  generally 
missing.  This  report  presents  equations  which  can  be  used  to 
calculate  these  missing  covariances  from  quantities  which  are 
generally  available. 

Most  standard  regression  references  discuss  calculation  of 
confidence  limits  for  point  estimates  of  the  dependent  variable 
when  these  point  estimates  are  calculated  from  the  regression 
equation.  This  report  presents  equations  for  similar  limits  for 
the  independent  variables,  again  from  quantities  generally  avail- 
able when  a deviation-form  routine  is  used.  A different  inter- 
pretation is  suggested  for  these  limits  than  is  seen  in  the 
references. 

A numerical  example  is  provided. 

ADMINISTRATIVE  INFORMATION 

This  report  is  a result  of  work  performed  under  Program  Element  60000N, 
Task  Area  OMN,  and  Work  Unit  1-1870-003. 

1.  INTRODUCTION 

Multiple  linear  regression  is  a method  of  determining  a linear 
relationship  between  a dependent  variable  and  a collection  of  independent 
variables.  The  dependent  variable  is  assumed  to  be  equal  to  the  sum  of  a 
linear  function  of  the  independent  variables  and  a random  variable  which 
has  zero  mean  and  unknown  variance.  Multiple  linear  regression  provides 
the  (least  squares)  best  estimates  of  the  parameters  of  the  linear  function 
based  on  given  data.  In  addition,  it  also  can  provide  indications  of  how 
well  the  calculated  linear  function  fits  the  data,  how  much  the  parameter 
estimates  might  vary  from  the  "true"  values,  and  how  much  point  estimates 
obtained  by  using  the  calculated  regression  equation  might  vary  from  the 
"true"  values. 

The  calculations  involved  in  determining  the  parameter  estimates  and 
other  quantities  may  be  provided  in  either  of  two  forms.  The  standard  form 
determines  a constant  term  and  a coefficient  for  each  of  the  independent 
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variables.  The  deviation  form  replaces  each  data  value  for  each  variable 
with  its  deviation  from  the  sample  mean  (for  that  variable)  and  determines 
only  the  coefficients.  The  constant  term  is  readily  calculated  from  the 
means  of  the  variables  and  the  other  parameter  estimates.  Other  quantities, 
although  also  available  by  calculation,  are  generally  ignored  in  a dis- 
cussion of  the  deviation  form.  It  is  the  purpose  of  this  report  to  develop 
expressions  for  these  quantities. 

Many  books  and  papers  have  been  written  on  multiple  linear  regression. 

The  reader  is  assumed  to  have  been  familiar  at  one  time  with  the  techniques 

involved.  However,  sufficient  review  is  given  that  this  familiarity  need 

not  be  recent.  Some  of  the  better  known  facts  will  be  stated  without  proof 

1* 

in  this  report.  For  background  the  reader  is  referred  to  Acton,  Draper 
2 3 

and  Smith  or  Johnston.  Johnston  is  the  primary  reference  used  by  the 
author. 

Multiple  linear  regression  theory  provides  an  estimated  covariance 

structure  for  the  parameter  estimates.  However,  when  the  deviation  form 

is  used  to  calculate  these  parameter  estimates,  the  portions  of  this  matrix 

which  involve  the  constant  term  are  generally  missing.  This  is  the  case, 

for  example,  when  the  International  Mathematical  and  Statistical  Libraries 

4 

(IMSL)  routines  RLSTEP  and  RLFORC  are  used  in  a stepwise  multiple  linear 
regression  application.  This  report  derives  equations  which  can  be  used  to 
calculate  these  missing  covariances  from  quantities  which  are  generally 
available:  the  sum  of  the  squared  residuals,  the  means  of  the  variables, 

and  the  remainder  of  the  estimated  covariance  matrix. 

Most  standard  regression  references  discuss  calculation  of  confidence 
limits  for  point  estimates  of  the  dependent  variable  when  these  point 
estimates  are  calculated  from  the  regression  equation.  Some  (for  example, 
Acton^)  also  discuss  similar  limits  for  similarly  obtained  point  estimates 
for  the  independent  variables.  However,  these  discussions  generally  are 
restricted  to  cases  in  which  there  is  a single  independent  variable.  Also, 
the  deviation  forms  are  not  considered  in  this  context.  Equations  are 
derived  in  this  report  for  calculation  of  such  limits  in  cases  in  which 


*A  complete  listing  of  references  is  given  on  page  29. 
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there  is  more  than  one  independent  variable  and  the  regression  equation 
has  been  calculated  using  the  deviation  form.  In  this  report  a different 
interpretation  is  suggested  for  these  limits  than  is  seen  in  the  references. 

Section  2 reviews  the  assumptions  involved  in  use  of  a multiple  linear 
regression  procedure.  The  notation  used  throughout  the  report  is  also 
introduced  in  this  section. 

Section  3 provides  a basic  review  of  multiple  linear  regression  and 
states  the  pertinent  equations  used  in  calculating  the  parameter  estimates 
and  other  quantities  when  the  standard  form  is  used. 

Section  4 develops  similar  equations  using  the  deviation  form. 

Equations  are  included  for  the  complete  estimated  covariance  matrix  of  the 
parameter  estimates.  A familarity  with  some  simple  matrix  operations  is 
necessary  for  a proper  understanding  of  this  section. 

Section  5 specifies,  step  by  step,  the  numerical  procedures  to  be 
followed  in  the  application  of  multiple  linear  regression  as  described  in 
Section  4.  In  particular,  the  matrix  equations  of  Section  4 are  rewritten 
in  a nonmatrix  form  which  can  be  used  in  a computer  program.  The  limits 
on  predicted  values  of  an  independent  variable  are  discussed  in  this 
section. 

A numerical  example  is  provided  in  Section  6.  This  example  follows 
the  step-by-step  procedure  given  in  Section  5.  It  is  of  sufficient 
complexity  to  exemplify  each  step  and  yet  of  sufficient  simplicity  to 
allow  hand  calculation. 

A final  section  provides  a summary. 

2.  ASSUMPTIONS  AND  NOTATION 

Given  a dependent  variable  Y and  k - 1 independent  variables  , 

X^,  ...»  X^,  multiple  linear  regression  assumes  a model  of  the  form 

Y = t>i  + b2X2  + b3X3  + ...  + bkXk  + u (1) 
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where  the  parameters  b^,  b^,  . ..,  b^  are  fixed  constants  and  where  u is  a 
random  variable  with  zero  mean  and  constant,  but  unknown,  variance  w^. 
Generally,  Y must  be  a random  variable  since  u is,  but  the  independent 
variables  are  assumed  to  be  nonrandom.  Note  that  it  is  not  necessary  at 
this  point  to  assume  that  the  random  errors  u are  normally  distributed. 

The  assumed  nonrandomness  of  the  independent  variables  means  that  the 
values  of  these  variables  have  been  measured  accurately  and  are  therefore 
known  exactly.  Strictly  speaking,  randomness  in  the  independent  variables 
(for  example,  inaccuracies  in  measuring  and  recording  their  values) 
violates  this  assumption.  However,  this  fact  is  usually  ignored  when  re- 
gression is  applied,  the  prevailing  feeling  being  that  the  (hopefully 
small)  randomness  in  the  independent  variables  can  be  considered  as  a part 
of  the  randomness  in  u.  The  robustness  of  the  method  often  provides  useful 
results  in  such  cases. 

In  general,  column  vectors  are  favored  over  row  vectors  in  this  report. 
The  transpose  of  a vector  or  matrix  is  indicated  by  a prime.  For  con- 
venience, each  vector  will  be  introduced  in  terms  of  its  transpose.  Thus, 
b,  the  vector  of  parameters,  is  introduced  by  b'  = (b^  . . . ,b^) . 

The  vector,  all  of  whose  components  are  zero,  is  denoted  by  0.  The 
vector,  all  of  whose  components  are  one,  is  denoted  by  1.  When  used,  th. 
size  of  these  vectors  is  clear  from  the  context.  Similarly,  the  dimension 
of  any  identity  matrix,  I,  is  clear  from  its  use. 

The  data  are  assumed  to  be  arranged  in  k- tuples,  called  data  points. 

The  i — data  point  is  (X^^.X^^, . . . ,X^^,Y^) , where  is  the  i — observed 
value  of  the  dependent  variable  and  X. . is  the  i—  observed  value  of  the 
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j — independent  variable.  There  are  assumed  to  be  n data  points.  The 
values  in  each  point  are  assumed  to  be  associated  according  to  Equation  (1), 
so  that 


Yi  " bl  + b2X2i  + 


+ b,  X,  . + u . 
k ki  l 


(2) 


where  u^  is  some  unknown  value  of  the  random  variable  u.  In  matrix  form 
Equation  (2)  may  be  written 


i 
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Y = Xb  + u 


where 


Y'  = (Y1,Y2,...,Yn),  b'  = (b1,b2,...,bk),  u'  = (u1,u2,...,un),  and 


1 X21  X31 


1 X22  X32  ‘ ‘ ' Xk2 


IX.  X,  ...  X. 
2n  3n  kn 


It  is  further  assumed  that  no  values  are  missing.  That  is,  a data 
point  is  not  used  in  the  regression  unless  all  variables  in  it  have  values. 

The  sample  mean  of  a variable  is  denoted  by  a bar  above  the  variable 
name . Thus , 


' \ 


l'Y/n 


The  deviation  of  a variable  from  its  sample  mean  is  denoted  by  replacing 
the  upper  case  symbol  by  the  corresponding  lower  case  symbol: 


Yi  = Yi  - Y 
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x. . = X. . - X. 
J1  Ji  J 


The  sample  variances,  covariances,  and  correlations  are  easily  expressed 
in  terms  of  these  deviations.  For  example: 


var (X^ ) 


cov (X , 


j’V  "(^*jixhi)/(n-1) 


.1/2  . .1/2 

n \ / n \ 


c°rr(X.,Xh)  =^Xjixhi)  / ^27^ 


For  each  j = 1,  2,  ...,  k,  multiple  linear  regression  finds  an  esti- 
mate Bj  for  the  value  of  the  parameter  in  Equation  (1).  The  result, 
then,  is  the  regression  equation: 


Y “ B1  + B2X2  + B3X3  + •••  + BkXk 


which  can  be  used  to  arrive  at  an  estimated  value  for  any  variable  given 
values  for  the  others.  (Caution:  No  guarantee  has  yet  been  given  concern- 
ing the  accuracy  of  such  estimates.)  In  particular,  for  i = 1,  2,  . . . , n. 


Y = B,  + B X„ . + B„  X_ . + ...  + B.  X.  . 
l 1 2 2i  J 3x  k ki 


is  the  estimated,  or  predicted,  value  of  Y in  the  i — data  point.  This 
value  can  be  compared  to  the  actual  value,  Y^.  The  difference 
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(7) 


e.  = Y. 
1 x 


is  called  the  residual  and  represents  whac  must  be  added  to  the  predicted 
value  to  get  the  actual  value.  In  matrix  form  the  vector  of  residuals, 

- = (ei»e2> * • ,en) > is  8iven  by 

e = Y - 7 = X “ x5  (8) 

Note  from  Equation  (7)  that  e^  is  an  estimate  of  the  value  of  u^.  This 

2 

fact  is  used  in  the  next  section  to  arrive  at  an  estimate  for  w . 

The  matrix  Equations  (3)  and  (8)  provide  a starting  point  for  an 

elegant  development  of  the  multiple  linear  regression  equations  given 

without  proof  in  the  next  section.  The  interested  reader  is  referred  to 
3 

Johnston,  Chapter  5. 

In  addition  to  the  data  matrix  in  standard  form  given  in  .quation  (A) , 
the  data  matrix  in  deviation  form: 


21  X31  *'■ 

Xkl 

22  X32  *•* 

Xk2 

(9) 


X„  X.  ...  X. 

2n  3n  kn 


will  also  be  used. 

In  addition  to  the  assumptions  given  earlier  in  this  section,  it  is 
necessary  to  assume  that  the  random  variable  u is  normally  distributed  in 
order  to  derive  confidence  intervals.  For  other  results,  however,  this 
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assumption  is  not  needed.  On  the  other  hand,  it  is  always  necessary  to 

assume  independence  of  u.  and  u,  for  i not  equal  to  h.  Relaxation  of  many 

in  ^ 

of  the  assumptions  stated  here  is  considered  in  Johnston. 

3.  STANDARD  REGRESSION  EQUATIONS 

In  this  section  the  standard  multiple  linear  regression  equations  are 
presented  in  matrix  form.  Since  development  of  these  equations  is  con- 
tained in  any  good  book  on  regression  analysis,  it  is  not  repeated  here. 
Recall  from  the  previous  section  that  X is  the  data  matrix  in  standard 
form,  Y is  the  vector  of  sample  values  of  the  dependent  variable,  b is  the 

vector  of  regression  parameters,  B is  the  vector  of  parameter  estimates, 

2 

w is  the  variance  of  the  random  variable  u (see  Equation  (1)),  and  e is 
the  vector  of  residuals.  Also  used  in  this  section  is  the  vector  of 
deviations  of  the  value  of  the  dependent  variable  about  its  mean: 

l = (yry2,...,yn). 

The  objective  of  multiple  linear  regression  is  to  determine  the  vector 
B which  will  minimize  e'e,  the  sum  of  the  squared  residuals.  The  vector  B 
which  accomplishes  this  objective  is  given  by 


B = (X,X)_1X'Y 


The  matrix  X'X  is  a square  symmetric  matrix  called  the  information  matrix. 

Note  that  each  B^  is  a linear  combination  of  the  random  variables 

Y, ,Y.,...,Y  . (The  elements  of  X were  assumed  to  be  nonrandom.)  From  the 
1 2 n 

assumptions  imposed  upon  u,  it  is  determined  that 


E(B)  = b 


var(B)  = w2(X'X)-1 


So  B is  an  unbiased  estimator  of  b (that  is,  B^  is  an  unbiased  esti- 
mator of  bj  for  each  j =1,  2,  ...»  k) . The  Gauss-Markov  Theorem  on  least 


8 


squares  indicates  that  B is  the  best  linear  unbiased  estimator  of  b.  Al- 

2 

though  Equation  (12)  gives  the  covariance  structure  of  B,  w is  unknown  and 
var(B)  cannot  actually  be  determined.  However,  with  the  choice  of  B given 
in  Equation  (10),  the  expected  value  of  the  sum  of  the  squared  residuals 
is  given  by 

E(e’e)  = (n-k)w2  (13) 

2 

Thus,  w can  be  approximated 

w2  = v2  = (e'e)/(n-k)  (14) 

From  the  approximation  of  Equation  (12) , an  estimated  covariance  matrix 
for  B is  found  to  be 


I 

I 


var(B)  = v2 (X'X)-1  = (e' e) (X'X)-1/ (n-k)  (15) 

2 

The  coefficient  of  multiple  correlation,  R , often  used  as  a measure 
of  the  goodness  of  fit  of  the  regression  equation  to  the  given  data,  is 
calculated  from 


R2  = 1 - (e'e)/(jr'j;)  (16) 

The  regression  equation,  Equation  (5),  can  be  used  to  determine  a 
predicted  value  for  the  dependent  variable  given  values  for  the  independent 
variables.  On  the  assumption  that  u (and  therefore  Y and  B)  is  normally 
distributed,  confidence  limits  may  be  placed  around  this  prediction.  For 
example,  suppose  that  it  is  desired  to  determine  such  confidence  limits  for 
Y when  X - for  j - 2,  3,  . . . , k.  For  Z'  = (1,Z2,Z3> . . . ,ZR) , 7 is 
normally  distributed  and 

E(7)  = Z'b  (17) 
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(18) 


var (E (Y) ) = w2Z' (X'X)  1Z 
var(T)  = w2(l  + Z'(X'X)_1Z) 


(19) 


However,  e'e  has  a chi-squared  distribution  with  n-k  degrees  of  freedom 
and  is  independent  of  Z'B.  The  confidence  limits  are  found,  by  shifting 
in  the  usual  way  to  a Student's  t distribution,  to  be,  for  E(Y), 


Z'B  ± tv 


(20) 


and,  for  Y, 


Z'B  ± tv 


/ 


1 + Z' (X'X) 


(21) 


where  t is  a critical  point  from  a t distribution  with  n-k  degrees  of 
freedom.  If  r is  the  desired  significance  level,  then  t is  the  (1  - r/2) 
critical  point.  (For  example,  for  90  percent  confidence  limits  the  sig- 
nificance level  is  r = 0.1,  and  t would  be  found  in  the  0.95  column  of  the 
table  of  t distribution  points.) 

The  regression  equation  can  also  be  used  to  calculate  a predicted 
value  for  one  of  the  independent  variables  given  values  for  the  dependent 
variable  and  each  of  the  other  independent  variables.  Limits  similar  to 
confidence  limits,  can  also  be  calculated  in  this  case.  These  limits  will 
be  discussed  in  Section  5. 

4.  REGRESSION  EQUATIONS  IN  DEVIATION  FORM 

In  this  section  the  deviation  form  equivalents  for  Equations  (10), 
(12),  (15),  (17),  (18),  (19),  (20),  and  (21)  are  presented.  As  noted  in 
the  introduction,  the  deviation  form  equivalent  of  Equat Lon  (15)  does  not 
contain  the  portion  of  the  estimated  covariance  matrix  the  parameter 
estimates  which  deals  with  the  constant  term.  Equations  for  these  co- 
variances  are  derived  in  this  section. 
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First,  consider  the  structure  of  X'X  and  x'x.  From  Equation  (4),  the 
definition  of  X,  note  that 


X'X  = 


n nX' 

nX  S 


(22) 


where  X = (X„  ,X  , . . . ,X,  ) and  S is  the  (k-l)x(k-l)  matrix  having  at  the 

— ^ ^ ^ th  th. 

intersection  of  its  (s-1) — row  and  its  (t-1) — column  the  element 


= Tx  X 

3-1, t-1  X , si  ti 


(23) 


i=l 


From  Equation  (9),  the  definition  of  x,  it  is  seen  that  the  element  at 
the  intersection  of  the  (s-1) — row  and  the  (t-1) — column  is  given  by 


(x'x)  , = /x  .x„. 

s-1,  t-1  ^ si  ti 

i=l 


ii 

y X .Xt . - nX  X 
si  ti  st 


i=l 


From  a comparison  of  Equations  (23)  and  (24),  the  matrix  form  for 
Equation  (24)  is  found  to  be 

x'x  = S - nXX' 


(24) 


(25) 


The  relationship  between  the  inverses  of  X'X  and  x'x  is  clarified  by 
an  examination  of  the  matrix 


11 


m 


Then, 


Q - 


1/n 

-X 


O' 

I 


Q(X'X) 


1 

0 


X' 

x'x 


Inversion  of  Equation  (27)  shows  that 


(26) 


(27) 


(x’xrV*1 


1 -X' (x'x) 
0 (x'x)'1 


(28) 


When  both  sides  of  this  last  equation  are  multiplied  on  the  right  by  Q as 
defined  in  Equation  (26), 


(X'X) 


1/n  + X'(x'x)_1X  -X' (x'x) 

-(x'x)  1X  (x'x)  1 


(29) 


Consideration  is  next  given  to  the  form  of  X'Y  and  x'y.  If  X.  is  the 
submatrix  of  X found  by  eliminating  the  first  column  of  X (consisting  of 
all  ones),  it  is  found  that 


x'y  = X. 'Y  - nYX 
This  is  the  matrix  equivalent  of 


Li  LI 

27vi  ■ 27 Vi  - V 


i=l 


i=l 


(30) 


(31) 
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It  is  also  found  that 


The  deviation  form  equivalent  of  Equation  (10)  is 


B.  = (x'x)'1x'y 


From  Equations  (10),  (29),  (30),  (32),  and  (33),  it  is  determined  that 


Y + nYX'(x'x)_1X  - X'(x'x)  1X. 'Y 


-nY(x'x)-1X  + (x'x)  XX. 'Y 


Y - X'B. 


That  is,  B.  is  the  subvector  of  B found  by  eliminating  B^.  Furthermore, 
B^  may  also  be  found  from  Equation  (34): 


Bx  = Y - X'B. 


Note  that  Equations  (15)  and  (29)  yield  the  estimated  covariance  matrix 
for  B in  terms  of  x'x  and  the  sample  means: 


var(B.)  = w2(x'x)  1 


= (e ' e) (x ' x)  V (n-k) 


» 
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cov(B. ,B^)  = -w^X'(x'x)  ^ 


= -(e'e)X' (x'x)  '*"/(n-k) 


var(B^)  = w2(l/n  + X'(x'x)  ^X) 


(e’e) (1/n  + X' (x'x)  1X)/(n-k) 


Finally,  consideration  is  given  to  estimation  in  the  deviation  context. 
Since  all  the  elements  of  B may  be  calculated  from  Equations  (33)  and  (35), 
point  estimates  may  be  calculated  from  the  regression  equation, 

Equation  (5),  or  from  the  deviation  form  of  Equation  (5),  which  is: 


y = B2X2  + B3X3  + • • • + \xk 


As  in  the  preceding  section,  suppose  that  confidence  limits  for  Y are 

desired  when  X.  = Z.,  for  j = 2,  3,  ...»  k.  If  Z. ' = (Z„ , Z„, . . . , Z,  ) and 
J J Z J K 

zf  = Z!  - X*  = (z  ,...,z,),  where  z.  = Z.  - X.,  for  each  j,  then 
” ~ “ Z k J J J 

z'(x'x)~1z  = Z. ' (x'x)-1Z.  - Z. ' (x'x)_1X 


- X' (x'x) 


x'x)  1Z.  + 


X' (x'x)X 


Thus,  Equation  (29)  implies  that 


Z'(X'X)~1Z  = 1/n  + z ' (x’x)  Lz 


When  this  result  is  used  in  Equations  (18)  through  (21),  the  deviation 
equivalents  are: 


var(E(y))  = w2(l/n  + z'(x'x)  ^z) 
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var(y)  = w^[(n+l)/n  + z'(x'x)  ^z]  (43) 


z'B.  ± tv  + z' (x'x)  1Z  (44) 

z'B.  ± tv^/(n+l)/n  + z'(x'x)  'ST  (45) 

where  Equations  (44)  and  (45)  are  confidence  limits  for  E(y)  and  y, 
respectively.  As  in  Section  3,  v is  as  given  in  Equation  (14)  and  t is  a 
critical  point  from  a t distribution  with  n-k  degrees  of  freedom.  The 
confidence  limits  for  Y may  be  obtained  from  Equation  (45)  by  adding  the 
mean  Y to  the  limits  for  ”y".  Equivalent  to  Equation  (17)  is 

E(y)  = z'b.  (46) 

where  b.  ' = (b2>b^, . . . ,b  ). 

In  the  next  section  some  of  the  matrix  equations  of  this  section  are 
put  into  a form  more  amenable  to  numerical  calculation. 

5.  NUMERICAL  PROCEDURES 

Suppose  a set  of  n data  points  is  available  for  performance  of  a 
multiple  linear  regression.  This  section  sets  forth  the  steps  of  the 
numerical  procedure  to  be  used  in  accomplishing  this  regression  and 
determining  the  other  information  discussed  in  earlier  sections.  The 
deviation  form  equations  of  the  preceding  section  are  used  here. 

The  first  step  is  to  develop  the  matrix  x'x  and  the  vector  x'y.  This 
may  be  accomplished  with  a single  pass  through  the  data  by  accumulating 


(a) 


z 


X.  . 
Ji 


'b) 


z 


Y. 

l 
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fx..x  . 

]1  ml 


y x Y 

ji  i 


for  each  j,m  = 2,  3,  k.  The  means  and  Y are  found  by  dividing  (a) 

and  (b),  respectively,  by  n.  These  means  are  then  used  with  (c)  to  de- 
termine the  elements  of  x'x  from  Equation  (24),  and  with  (d)  to  determine 
the  elements  of  x'y  from 


(x'y)  •>  = / x .y. 

- s-1  si  1 


1 1 

2X  .Y.  - 
si  1 


So  that  R2  may  be  calculated  from  Equation  (16),  i shou^d  also  be 

, r i=l 

accumulated  and  used  to  determine  y y from 

n 

y'y  = ^ Y2  - nY2  (48) 

i=l 

Since  the  means  will  be  required  later,  they  should  be  retained.  horn  and 
McCracken^  indicate  in  Sections  3.5  and  7.8  that  this  single  pass  method 
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may  not  be  as  accurate  numerically  as  a two-pass  method  in  which  (a)  and 

(b)  are  accumulated  on  the  first  pass  and  the  sums  (over  i)  of  x. .x 
2 1 1 
Xjiyi’  an<*  ^i  °n  t*ie  second  pass. 

The  second  step  is  to  invert  x'x.  Most  modern  computers  have 

reasonably  accurate  matrix  inversion  routines  available.  For  convenience 

the  element  of  (x'x)  ^ at  the  intersection  of  row  s and  column  t will  be 


denoted  by  c ,,  . . That  is, 

1 s+1 , t+1 


'22  C23  "■  C2k 


C32  C33  C3k 


(x'x)  1 = 


Ck2  Ck3  * ' - Ckk 


The  third  step  is  to  determine  the  estimates  of  the  regression 
parameters.  For  j =2,  3,  ...,  k,  these  estimates  are  calculated  (see 
Equation  (33))  from 


in.  I ll 

J.  = ^ ( Xx  .y 

3 jL-j  jm  \ mi  i 


The  value  of  may  then  be  calculated  (see  Equation  (35))  from 


!i  ■ 5 - 27ba 

j=2 
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The  regression  equation.  Equation  (5),  is  now  known  and  can  be  used  to 

arrive  at  a point  estimate  for  Y,  given  values  = Z^, 

X^  = Z^  for  the  independent  variables.  It  can  also  be  used  to  arrive 

at  a point  estimate  for  independent  variable  X given  Y = Y'  and,  for 

m 

j = 2,  3,  ...,  m-1,  m+1,  ...,  k,  X^  = from 


X = |Y' 

m 1 


5i  - ZVi  / 


Zt  h 

refers  to  the  sum  exclusive  of  the  m — term. 

2 

The  next  step  is  to  calculate  R from  Equation  (16).  Equations  (5) 

and  (7)  may  be  used  and  a pass  may  be  made  through  the  data  to  calculate 

the  residual  associated  with  each  data  point.  The  sum  of  the  squared 

residuals  can  then  be  accumulated  and  used  in  Equation  (16)  to  calculate 
2 

R . An  alternative  procedure,  which  does  not  necessitate  calculation  of 

3 

the  individual  residuals,  is  based  on  the  fact  (see  Johnston, 

Equation  (5.22))  that 


e'e  = y'y  - B. 'x'y 


In  addition,  Equation  (53)  can  be  used  in  Equation  (14)  to  provide  the 

2 

following  estimate  of  w , the  variance  of  the  dependent  variable: 


2 . 2 
w = v 


(2Tx3iyi 

j=2  \i=l 


/ (n-k) 


(55) 


The  fifth  step  is  to  calculate  the  estimated  covariance  structure  of 
the  parameter  estimates  from  Equations  (36),  (37),  and  (38).  From 
Equation  (38)  the  estimated  variance  of  is  seen  to  be 


var(B1)  = v ( 1/ 


» + Z Zcst\\ 


(56) 


s=2  t=2 


For  j = 2,  3,  ...,  k,  Equation  (37)  indicates  that 


cov(Bj ,B1)  = - v 


JY  x. 


JS  s 


(57) 


s=2 


Finally,  for  j,m  = 2,  3,  ...,  k.  Equation  (36)  yields 


2 

cov(B. , B ) = v c , 
j m jm 


(58) 


For  any  given  set  of  values,  X£  = Z-2’  ^3  = Z3’  •••»  = Zk* 

confidence  limits  at  various  levels  of  significance  can  be  calculated  for 
the  dependent  variable  using  Equation  (45).  For  lOO(l-r)  percent 
confidence  limits  use 
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i.  + y b.z. 

1 jL*  : j 


/ k k 

+ tv  /(n+l)/n  + 2J  ^'st'VV^t-V 
r s=2  t=2 


where  t is  the  (1  - r/2)  critical  point  from  a t distribution  with  n-k 
degrees  of  freedom.  Note  that  the  point  estimate  of  the  dependent  variable 
given  by  Equation  (5)  is  midway  between  these  limits  and  that  the  length  of 
the  interval  bounded  by  these  limits  is  2t  times  the  square  root  of  the 
estimated  variance  of  the  point  estimate  (see  Equation  (43)). 

Converting  Equation  (59)  to  a probability  statement  produces  a method 
of  calculating  limits,  similar  to  confidence  limits,  for  the  predicted 
value  of  an  independent  variable,  given  values  for  the  dependent  variable 
and  each  of  the  other  independent  variables.  Since  the  independent 
variables  are  not  random  (see  Section  2),  these  limits  are  not  truly 
confidence  limits.  However,  they  do  have  a potentially  useful  inter- 
pretation as  a specification  of  the  range  of  values  for  the  independent 
variable  for  which  a particular  value  lies  within  a confidence  interval  for 
the  dependent  variable. 

Suppose  that  values  have  been  specified  for  X„,  X_,  ...,  X 

Z i m-i 

X .,  ....  X^  as  above.  Suppose  further  that  it  is  desired  to  find  the 

values  of  X for  which  Y = Y'  will  lie  in  a lOO(l-r)  percent  confidence 
m 

interval  for  (the  random  variable)  Y.  For  each  value  Z of  X , 

m m 


Y " BX  - 


= 

j = 2 


tv.  /(n+1)  / 


= 1 - r 


k k 


" + £ Z'^'VV'^-V 


s=2  t=2 


If  the  inequality  within  the  braces  in  Equation  (60)  is  solved  for  the 

unknown  Z with  Y = Y'  the  values  of  X for  which  Y = Y'  lies  in  a 
m m 

100(l-r)  percent  confidence  interval  for  Y may  be  determined. 

The  result  of  such  an  algebraic  procedure  is 


fZ2  + gZ  + h = 0 
m m 


where 


2 2 2 
f = B - t v c 

m mm 


2 

g = - 2t  v 


y c.  (z.-x.)  - c x 
jm  j j ram  m 


- 2B  Y 
m 


■ - Bl  - T\ 

1 / 1 J 

3*2 


h = Y’  - B 


~ m 

i - 

i ^ < j j 


- t2v2  (n+l)/n 


k m k m 


Z,  2>st<VV<VV 


s=2  t=2 


ill 

y c .(z.-x.)  + c x2 

jL-t  mj  j j mm  m 
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The  zeros  of  the  function  on  the  left  side  of  Equation  (61)  can  be 
determined  using  the  quadratic  formula.  If  this  function  has  no  real 
zeros,  Y = Y'  lies  in  the  confidence  interval  for  all  values  of  X 

m 

(provided  X,  = Z, , . . . , X , = Z , , X ...  = Z X,  = Z.  ) . Otherwise, 

11  m-1  m-1  m+1  m+1  k k 

the  zeros  may  be  denoted  by  Z^  and  Z^,  Z^  being  the  larger  of  the  two 

values.  Then  Equation  (61)  may  be  used  to  determine  the  values  of  X for 

m 

which  Y = Y'  lies  in  the  confidence  interval.  These  values  are  all  those 

between  Z and  Z , if  X from  Equation  (52)  is  between  Z and  Z , and  all 
e u m e u 

values  less  than  or  equal  to  Zg  or  larger  than  or  equal  to  Z^  otherwise. 

It  should  be  remembered  that  the  last  few  paragraphs  depend  on  the 
assumption  that  the  random  variable  u (and  therefore  the  dependent 
variable)  has  a normal  distribution,  as  well  as  the  other  assumptions 
stated  in  Section  2.  Such  normality  assumptions  imply  normality  of  the 
residuals.  If  the  residual  is  calculated  for  each  data  point,  a standard 
normality  test  will  indicate  how  valid  such  an  assumption  may  be. 

6.  A NUMERICAL  EXAMPLE 

The  example  discussed  in  this  section  is  discussed  in  some  detail  by 
2 

Draper  and  Smith,  although  they  do  not  cover  some  of  the  details  con- 
sidered here.  The  development  presented  here  follows  the  steps  of  the 
preceding  section. 

The  data  consists  of  n = 13  data  points,  each  point  consisting  of  a 

value  of  a dependent  variable,  Y,  and  each  of  two  independent  variables, 

X^  and  X^.  The  data  are  listed  in  Table  1.  Note,  for  example,  that 

X.  = 11,  X.  , = 55,  and  Y,  = 102.7. 

2 , q J , b / 


22 


I f 


Step  1: 


13 


13 


z 


X.,  =97  X„  = 7.461538 

2i  2 


2* 


X2  = 1139 

2i 


i=l 


i=l 


13 


13 


2 


X3.  = 626  X3  = 48.153846  ^ X,^  = 33050 


i=l 


i=l 


13 


2yi- 


1240.5  Y = 95.423077 


i=l 


13 


2 


X0.X_.  = 4922 
2i  3i 


13 

rv2 


YT  = 121088.09 


1=1 


i=l 


13  13 

10032  2 x3iYi  = 62027-8 

i=l  i=l 


2v, 


xx 


415.230769 


251.076923 


251.076923 


2905.692310 


775.961538' 


x y =1 


y y = 


^2292. 953850/ 
2715.7631 


| 
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0.002541066 


- 0.0002195701 


Step  2: 


(x'x) 


-1 


0.0002195701 


0.0003631248 


Step  3: 


1.468306 


B3  = 0.6622505 


B = 52.57735 


The  regression  equation  is 


Y = 52.57735  + 1.468306  + 0.6622505  X3 


Step  4: 


e'e  = 57.9045 
2 


R 


0.9786784 


2 . 2 
w = v = 

5. 790450 

Step  5: 

var(B1) 

= 5.226595 

cov(B1,B2) 

= -0.04856520 

cov(B1,B3) 

= -0.09176431 

var(B2) 

= 0.01471392 

cov(B2,B3) 

= -0.001271410 

var(B3) 

= 0.002102656 

Step  6: 

Confidence 

limits  for  Y when  X2 

= 6 and  X3  = 50  are 

94.49971  ± 2.506258 

t 

The  point 

estimate  for  Y 

is  94.49971,  and  t is 

from  a Student's  t distri 

bution  with  ten  degrees  of  freedom.  For  example,  the  95  percent  confidence 
interval  (r  = 0.05;  t = 2.228  from  the  97.5  percent  column  of  a t- 
distribution  table)  is 


(88.91577,100.0837) 


Step  7:  Finally,  with  Y'  = 120,  when  = 50, 

f = 2.155922  - 0.01471392  t2 
g = - 100.7555  + 0.2242714  t2 
h = 1177.185  - 7.097254  t2 

The  resulting  limits  for  different  significance  levels  are  given  in  the 
last  two  columns  of  Table  2.  Since  the  point  estimate  of  X^  23.36715, 


TABLE  2 - LIMITS  FOR  X2  IN  NUMERICAL  EXAMPLE 


r 

t 

z 

e 

z 

u 

0.001 

4.587 

15.06735 

36.94749 

0.01 

3.169 

17.39803 

31.65355 

0.05 

2.228 

19.03294 

28.80569 

0.10 

1.812 

19. 78620 

27.66996 

0.20 

1.372 

20.60728 

26.53686 

0.50 

0.700 

21.91731 

24.92266 

Y'  = 120  lies  within  the  indicated  confidence  interval  for  Y so  long  as  X^ 
is  between  Zg  and  Zy.  For  example,  Y'  = 120  lies  within  a 95  percent 
confidence  interval  for  Y so  long  as  X2  is  between  19.03294  and  28.80569. 
Similarly,  with  Y'  = 120,  when  X^  = 6, 

f = 0.4385757  - 0.002102684  t2 
g = - 77.63273  + 0.1987855  t2 
h = 3435.462  - 10.96396  t2 


The  resulting  limits  at  different  significance  levels  are  given  in 
Table  3.  The  point  estimate  of  is  88.50550.  Hence,  Y'  = 120  lies 


TABLE  3 - LIMITS  FOR  X3  IN  NUMERICAL  EXAMPLE 


r 

t 

z 

e 

z 

u 

0.001 

4.587 

69.75458 

116.50926 

0.01 

3.169 

75.06417 

106.11853 

0.05 

2.228 

78.78176 

100.23987 

0.10 

1.812 

80.48814 

97.84187 

0.20 

1.372 

82.34159 

95.42050 

0.50 

0.700 

85.28138 

91.92384 

within  the  confidence  interval  for  Y so  long  as  X^  is  between  Zg  and 
indicated  in  Table  3. 


Z as 
u 


7 . SUMMARY 

Standard  references  on  multiple  linear  regression  give  explicit 
formulas  for  determination  of  the  regression  parameters,  the  estimated 
covariance  structure  of  these  parameter  estimates,  and  interval  estimates 
for  the  dependent  variable  (about  a value  predicted  from  the  regression 
equation),  given  values  for  the  independent  variables.  When  the  regression 
is  performed  in  deviation  from,  the  constant  term  must  be  calculated  from 
an  equation  which  is  generally  given  in  such  references.  However,  formulas 
for  calculation  of  the  portions  of  the  covariance  matrix  associated  with 
the  constant  term  are  generally  missing.  Such  formulas  are  derived  in 
this  report. 

In  addition,  formulas  are  derived  for  limits,  similar  to  confidence 
limits,  on  the  value  of  an  independent  variable,  given  values  for  the 
dependent  variable  and  each  of  the  other  independent  variables.  The  proper 
interpretation  of  such  limits  is  also  given. 

The  formulas  developed  here  have  been  utilized  in  a pair  of  computer 
programs,  one  a batch  program  and  the  other  an  interactive  program.  Both 


| 
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